ANALHITZA: a tool to extract linguistic information from large corpora in Humanities research
نویسندگان
چکیده
The reduced size of corpora in some areas of research is due to the lack of tools to process massively and easily the language under study. In this article, we present ANALHITZA, a tool which is being developed within the Clarink project, whose aim is the creation of linguistic technologies that are useful for research on Social Sciences and Humanities. ANALHITZA has been designed to extract linguistic information online from large corpora in an easy way. Besides, it is a multilingual tool which can process texts written in three languages: Basque, Spanish and English. Moreover, we present three real examples of study where ANALHITZA has been used. The tool can be redesigned or changed, according to the needs of the scientific community in the field of Humanities.
منابع مشابه
Lexical Bundles in English Abstracts of Research Articles Written by Iranian Scholars: Examples from Humanities
This paper investigates a special type of recurrent expressions, lexical bundles, defined as a sequence of three or more words that co-occur frequently in a particular register (Biber et al., 1999). Considering the importance of this group of multi-word sequences in academic prose, this study explores the forms and syntactic structures of three- and four-word bundles in English abstracts writte...
متن کاملAnnotating Corpora from Various Sources in the Humanities Domain
Voula Giouli Annotating corpora from various sources in the humanities domain: shortcomings and issues In this paper, we present work aimed at the linguistic annotation of Greek corpora that belong to the humanities domain, the focus being on the methodological principles as well as the implementation framework adopted. This framework builds on an existin...
متن کاملMatrix : a statistical method and software tool for linguistic analysis through corpus comparison
Matrix: A statistical method and software tool for linguistic analysis through corpus comparison A thesis submitted to Lancaster University for the degree of Ph.D. in Computer Science Paul Edward Rayson, B.Sc. September 2002 This thesis reports the development of a new kind of method and tool (Matrix) for advancing the statistical analysis of electronic corpora of linguistic data. First, we des...
متن کاملDraft WebCorp: providing a renewable data source for corpus linguists
The many electronic text corpora available nowadays present ever fewer obstacles to a wide range of corpus linguistic study. However, corpora are expensive resources to create and to update, and there remain problems for linguists if they seek access to very large, very recent, or changing language. The World Wide Web, whilst intended as an information source, is an obvious resource for the ret...
متن کاملTowards the Spatial Analysis of Vague and Imaginary Places: Evolving the Spatial Humanities through Medieval Romance
The establishment of the field of Spatial Humanities testifies to the success in the use of technologies such as Geographic Information Systems (GIS) for the analysis of texts in Humanities. Although the increasing volume of projects can be regarded as a sign of advance, an important challenge has remained unsolved in this field and it has been barely addressed. The majority of research dealing...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Procesamiento del Lenguaje Natural
دوره 58 شماره
صفحات -
تاریخ انتشار 2017